Software Vault: The Diamond Collection

home *** CD-ROM | disk | FTP | other *** search

/ Software Vault: The Diamond Collection / The Diamond Collection (Software Vault)(Digital Impact).ISO / cdr11 / gensrch.zip / GENSRCH.DOC < prev next >

Wrap

Text File | 1994-10-24 | 27KB | 615 lines

GENSRCH REVISION 2.0.1 This document is broken up into several sections: INTRODUCTION COPYRIGHT GENSERV DISADVANTAGES PROGRAM DESCRIPTIONS INSTALLATION HOW TO USE WITH A COLLECTION OF GEDCOM FILES USING WITH DATA ON A CD-ROM DEMOS WHAT TO EXPECT IN REAL LIFE WHAT'S A GEDCOM FILE WHAT'S A SOUNDEX THE AUTHOR INTRODUCTION A set of tools for genealogical research using gedcom files. Lets you search for common ancestors between different gedcom files. If you don't know what a gedcom file is, look at the end under "What's a gedcom file". I guess the best way to explain it is a simplified example. I'll leave some of the set up steps explained later out, just to give you the concept. By the way, this example is a true story. Let's say, you belong to a genealogy society (club) and have a collection of gedcom files from many of the people in the club. You want to find out if any of the club members have common ancestors with you, or between each other. After some initial setup which is done when you add a new gedcom file to your collection, you issue the command: gensrch Your_database_name *.ndx Or to send the results to a file instead of your screen: gensrch Your_database_name *.ndx > results Your_database_name is typically the name of your gedcom file when you set up your total database of gedcom files. The results look something like this: Search for matches to database johns1 ============================================================================== LAST, First INDI# Spouse name SNDX Birthdate Deathdate Database ----------------- ------ ----------------- ---- ----------- ----------- ------- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Possible match for Mathis, Frances 495 Coleman, R M320 johns1 ---------------------------------------- MATHIS, Frances 1902 COLEMAN, R Sr. M320 20 Feb 1749 1809 coleman2 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Possible match for SINGLETARRY, LYDIA 349 LADD, DANIEL S524 johns1 ---------------------------------------- SINGLETERY, Lydia 456 LADD, Daniel S524 30 Apr 1648 pricej1 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Possible match for Thayer, Cicely 394 DAVIS, James T600 28 May 1673 johns1 ---------------------------------------- THAYER, Cicely 255 T600 1595 28 May 1673 thayer My gedcom file's name is johns1.ged, so my database name is also johns1. The report says that in johns1.ged there is a person (Mathis, Frances) who looks to be the same as in the coleman2.ged. Also, I match people in pricej1.ged and thayer.ged. Once you know this, you can load coleman2.ged in your genealogy program (PAF, ROOTS, BK, etc.) and look to see if the coleman2 database goes back further than yours. I suppose you could even talk to Coleman, but that's rather archaic don't you think? Notice that I didn't tell the program to look for Mathis, or Thayer's. It looked at each individual that came from my gedcom file, and looked for a match against each individual that came from other gedcom files. It does soundex compares on the names, so exact spelling is not required. It does approximate matching on dates. It understands abbreviations and can match an abbreviated name to a fully spelled name. If you don't know what soundex is, look at the end under "What's a soundex". COPYRIGHT Gensrch is copyrighted software. However, you are encouraged to copy and share it. I place no restrictions on it's use for non-profit people and organizations. However, if it is used for commercial purposes, I want a piece of the action. It would be nice to break even. GENSERV Genserv (not gensrch) is a system on the internet that was started by Cliff Manis. It is his own collection of gedcom files along with utilities to access them. People like you and I, can access genserv's database in a similar manner to what you do with your local database. What's the price? A copy of your gedcom file. That's all. No money. You have just added to the value of genserv as a research tool by adding your gedcom file. The genserv system was the reason my gensrch software was developed, and portions of it have been ported to his genserv machine. Almost everything you can do with my gensrch software, you can do via email with genserv, and with a much larger collection of gedcom files. Genserv, like gensrch is a free service, and I would like to encourage anyone with internet access to join the genserv crowd. At the time this document was written it is being moved to Genserv@GenTech.Org. By the time you read this, is should be up and running again. Send mail to Genserv@GenTech.Org for requesting material about the server. DISADVANTAGES With a large local collection of gedcom files, no matter how you work it, it's a lot of data to wade through, and a slow process. Fortunately you don't have to be there. Go to lunch. The larger your collection, the more disk space you need for it. PROGRAM DESCRIPTIONS All of these programs will give a fairly large help screen if you just invoke them with no parameters. All options flags will be displayed. 1. ged2srch.exe Scans a gedcom file, and generates one line of information about each person in it, like this: CORLISS, Ann 237 ROBIE, John C642 8 Nov 1657 16 Jun 1691 johns1 It contains several fields. The first is the persons name, CORLISS, Ann. Next is a number that just indicates when he/she was encountered in the gedcom file. It sometimes is the same as the rin number used by your genealogy program. The third field is the spouses name. Next is the soundex code for the person. Next is birth date, and death date. Finally, the database name. There are two ways you as the administrator of this database can decide the database name. The easiest is to use the gedcom file name decide it with the -g option. ged2srch -g *.ged > tmp This command will scan all the gedcom files you have in this directory, and generate one liners for each person in each file and the database name will be the gedcom file name minus the ".ged". If you don't use the -g option, you must specify a database name. ged2srch johns1 c_demo.ged > tmp will generate data with the database name johns1. 2. brkmail.exe Breaks the possibly large file generated by ged2srch into a bunch of smaller files called a.ndx, b.ndx, ... z.ndx. Each containing surnames with the same starting letter as the starting letter of the file. It's called brkmail because I used to get this information from genserv by email and had to BReaK the MAIL messages up into these files. 3. srtrpt.exe Sorts the a.ndx ... z.ndx files. Puts them into soundex order, and deleted duplicate lines. This makes gensrch run faster, but is not absolutely necessary. None of the sorts done by these programs have a memory limitation. As long as there is disk space for the temporary files necessary there should be no problems with large file. Of course the bigger they are, the longer it takes to sort. 4. gensrch.exe The final report generator. Searches for matches. Several options of interest. You can specify how close dates must match, plus or minus days, months, years. You can specify how close the names must match. That one takes some explaining. All names, both first and last are tested with soundex compares, not string compares. Soundex is a neat thing because it allows slight changes in the spelling (Corliss and Corlisse) to still match. Sometimes though, it can be to lenient. For example CROWELL and CURLESS have the same soundex code. The -F x specifies how many letters the spelling may differ. I like to use -F 3. The -M option is nice if you are getting lots of matches. It only shows matches with More than me. In other words, if it finds a match that has dates or spouse names, etc. that your data does not have, it will display this match. It will not display a match if it appears that you have all the data the other has. The -g option allows you to search a gedcom file that you haven't added to your ndx files yet, and check it against them. For example, your neighbor brings over his gedcom file, and you want to do a quick check to impress him before going through all the steps to add him permanently to your database. gensrch -g c_demo.ged *.ndx For optimization reasons this option is more picky. It will only search an ndx file if it's name starts with the same letter as the surnames it is looking for. That is why, for instance, c_demo.ndx is named as it is, starting with a c. That's what brkmail does anyway, so it should be no problem. One common mistake with the -g option is to use a gedcom file that has already had its data merged into the ndx files. This will result in a zillion matches between the gedcom data and the duplicate data already in the ndx files. The -g option causes gensrch to create a database name for the new gedcom file in upper case. fred.ged will result in a database name of "FRED", not "fred". Normally the database names in the ndx files are lower case. This is so you won't have to be carefull what the name of your new gedcom file is. Gensrch will generate one ndi file for each ndx file it searches. This is to make searches run faster. It will remake the ndi file if it not there, or it detects that the ndx file has been updated by checking the dates of the two files. After running the demo, "gensrch johns1 c_demo.ndx", you will find a c_demo.ndi file now exists. Gensrch is being proposed for a Non profit CD-ROM project (Acadian), and one option was added for that environment. The -I (Upper case) option makes it ignore the dates for the ndi file. This is just paranoia on my part. I was concerned that the dates might not get installed properly on the CD, and the program would choke each time because it could not do anything about it. 5. combsrch.exe A pretty printer for gensrch. Takes a gensrch report like this: Possible match for CORLISS, Hildah 11 C642 18 Nov 1661 C_DEMO ---------------------------------------- CORLISS, Hildah 240 C642 18 Nov 1661 johns1 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Possible match for CORLISS, Hildah 11 C642 18 Nov 1661 C_DEMO ---------------------------------------- CORLISS, Hulda 508 KINGSBURY, S C642 18 Nov 1661 1720 pricej1 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Possible match for CORLISS, Hildah 11 C642 18 Nov 1661 C_DEMO ---------------------------------------- CORLISSE, Hulda 239 KINGSBURY, S C642 18 Nov 1661 26 Sep 1698 johns1 Which is 3 different matches to the same Hildah Corliss, and combines the matches so that C_DEMO's person CORLISS, Hildah is mentioned once like the following. - - =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Possible match[s] for CORLISS, Hildah 11 C642 18 Nov 1661 C_DEMO ---------------------------------------- CORLISS, Hildah 240 C642 18 Nov 1661 johns1 CORLISS, Hulda 508 KINGSBURY, S C642 18 Nov 1661 1720 pricej1 CORLISSE, Hulda 239 KINGSBURY, S C642 18 Nov 1661 26 Sep 1698 johns1 It can be used to process the output of gensrch in a file like this: gensrch johns1 c_demo.ndx > results combsrch results > results2 Or it can be used as a filter, eliminating the two step process like this: gensrch johns1 c_demo.ndx | combsrch > results 5. soundex.exe Just a utility to echo the soundex code for a name. example: soundex smith Soundex for smith is S530 6. deletedb.exe Scans through ndx files, and deletes the specified database. For instance you could delete all data from johns1, leaving all other data intact. This lets you delete all of johns1 gedcom data so you can replace it with a new copy without having to recreate the whole database. 7. cleanrpt.exe Scans files with ndx type data, and prints the valid report ndx style lines. This has the effect of stripping out any mail headers, etc. If you are creating all your own data locally, and not getting it from genserv, you won't need this. 8. Surnames.exe Since I belong to a genealogical society which wants a surname list, I cranked this out to generate a surname list from the gedcom files. It generates a list like the following which I format into a multi column report with my word processor. COOPER 11 johns1 CORLISS 2 corliss CORLISS 1 johns1 DALTON 5 johns1 DAVIDSON 6 johns1 DAVIS 15 corliss DAVIS 999 johns1 DAY 5 johns1 Note that even if johns1 has a hundred Corliss's, it will only show up in this list once. The number is the number of times that surname was encountered, up to a max of 999. I wanted to get a multi column report with my word processor, so I had to put a limit somewhere. Anything over 999 is just lots. 9. c_demo.ged A demo gedcom file. See the demo section. 10. c_demo.ndx A demo ndx file. See the demo section. INSTALLATION Not much to it. You can put the programs in your current directory and just run them there, or do the following. Put the programs where you put your other utilities. Under DOS, the command "path" will print out something like this: PATH=C:\BIN;C:\DOS;C:\WINWORD;C:\EXCEL;C:\WINDOWS Each part of that statement separated by semicolons is a directory that is searched for programs each time you type a command on the DOS command line. In the above case, if you tried to run the DOS editor by typing the command "edit gensrch.doc", DOS would look for edit in the c:\bin directory, then in c:\dos directory where it would finally find it. Any directory included in your path will do fine although in the above case dos, winword, excel, and windows should be avoided just to keep everything clean. The path definition is normally defined in your autoexec.bat, and you can add directories if you wish. The gensrch will search for the environmental variable "TMP" or "TEMP" for a place to put temporary files. For example, in your autoexec.bat Set TMP=C:\tmp or Set TEMP=C:\tmp Sets this variable. Don't forget to create the directory. If you don't have the variable defined, the temporary files will just end up in your current directory. Normally they are deleted when the program exits, except when you control c out of a program, they will be left behind. You can see if it is defined by the DOS command "set". It will dump all the environmental variables to the screen. You can browse through them looking for this variable. HOW TO USE WITH A LOCAL COLLECTION OF GEDCOM FILES 1. Place a copy of all your gedcom files in one directory. 2. ged2srch -g -v *.ged > tmp Creates the style of reports required by gensrch from the gedcom files in the file tmp. 3. brkmail tmp Takes the tmp file, and breaks it up into a.ndx, b.ndx, ... z.ndx using the first letter of the surname. You might want to run brkmail in a different directory than the one you keep your gedcom files, to keep things from getting cluttered. 4. When you have all your index files built: gensrch your_database_name *.ndx > matches in my case it's: gensrch johns1 *.ndx > matches The -M "More than me" option will create a much smaller matches file. The -p "Progress" option will send the reports to the screen as well as to the matches file. 5. Take a coffee break :-) 6. Browse through the matches file. Hopefully, it will have found other people's data who are searching your line, and often have dates, etc. that you don't. Something like this. Note, I am johns1. Looks like pricej1 has some data I am missing. Think I'll send him some mail. Search for matches to database johns1 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Possible match for AYER, Joseph 368 CORLISS, Sarah A600 johns1 ---------------------------------------- AYER, Joseph 516 CORLISS, Sarah A600 1660 1710 pricej1 =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Possible match for BROWN, Abigail 329 HARTSHORN, John B650 johns1 ---------------------------------------- BROWN, Abigail 575 HARTSHORN, John B650 1694 pricej1 Delete obvious non matches. USING WITH DATA ON A CD-ROM Some of the features on this release were developed to work with Yvon L. Cyr's Acadian/French Canadian CD-ROM project. This project is basicly a contribution of anyone's gedcom file who has Acadian/French Canadian ancestors. The gedcom files will be on the CD-ROM. There will probably be gensrch results files showing matches between people who submitted to the CD-ROM on the CD-ROM. The a.ndx - z.ndx and associated ndi files will be on the CD-ROM as generated by ged2srch, brkmail, and srtrpt. Those who submitted to the CD-ROM will be able to view their match results with any ascii editor since the matches will be on the CD-ROM. But what about those of us who didn't get our gedcom files on the CD-ROM? Is it useful to us? Yes! That's where the gensrch -g option comes in. With the gensrch -g option, you can scan for matches between your newly created gedcom file you made at home, with all the gedcom files on the CD-ROM. There are a few problems and their solutions you must be aware of first when working with CD-ROM based data. One of which is, a CD-ROM is slow. Sorry, Can't do much about that. Another is that a CD-ROM contains huge amounts of data. Lots of data takes lots of time to scan. Be patient. Go to lunch. Go to bed. Check it in the morning. The following rules for working with CD-ROM based data would also hold true with any write protected data such as that on a write protected floppy. The set of gensrch utilities create temporary files while they are doing their work. If the environmental variable "TEMP", or "TMP" are defined, they tell the program where to put these temporary files. If they are not defined, the temporary files end up in the current directory. If that current directory happens to be the CD-ROM, things just won't work, so see the section on INSTALLATION. There are way's around this problem without messing with "TEMP" or "TMP" if you wish. Let's say for example that your CD-ROM drive is g: and your regular hard drive is c:. Lets also say that you are in a directory on your hard drive c: called "george", or anything else you want to call it. From c:\george, you issue the command: gensrch -g myged.ged g:*.ndx > results.txt This searches all the ndx files on the g:CD-ROM for matches to your c:gedcom file, and puts the results in your c:results.txt. Note that in this case, your current directory is c:\george, which is not write protected, and temporary files can be created without problems. With "TEMP" or "TMP" defined properly you could work from the CD-ROM directly. For example: g: gensrch -g c:myged.ged *.ndx > c:results.txt In this case the current directory is on the CD-ROM drive, but "TEMP" tells the program to put the temporary files in a place typically like c:\temp. No problem. Note that you had to specify a writable destination for your results. If you get some sort of error, check to see if you are trying to create files on the CD-ROM, which of course you cannot do. DEMOS There are two files included with the package that are only there for demo purposes. c_demo.ndx and c_demo.ged c_demo.ndx is the type of data you would get after running ged2srch against your gedcom files, and brkmail against the output of ged2srch. It is an ascii file, so you can look at it with any ascii editor. Of course I cherry picked data that would contain matches, but that's what demos are about. To try it out, type the command: gensrch johns1 c_demo.ndx It should dump a bunch of matches to the screen. The same command followed by the pipe to a file syntax "> results", like this: gensrch johns1 c_demo.ndx > results.txt Will get the match data into the file results.txt which you can print or look at with any ascii editor. Once you have a large collection of gedcom files merged into ndx files, you might run into the situation where someone brings you their gedcom file, and you want to run a quick check for matches without going through all the ged2srch, brkmail steps. The -g option does this. gensrch -p -g c_demo.ged c_demo.ndx > results.txt First it generates ndx style data from your gedcom file, then it checks this new data against your old ndx files. The -p option sent a second copy of the match information to your screen so you could tell something was happening. Try this one: gensrch -p -g c_demo.ged c_demo.ndx | combsrch > results2.txt This did the same as the previous gensrch, but ran it through a "Pretty Printer". Look at the difference between results.txt, and results2.txt. Actually, this -g option was developed to allow searching ndx files that were placed on a CD. You can't add your gedcom data to the CD-ROM's data, so you must use the -g option. WHAT TO EXPECT IN REAL LIFE Not much at first! Remember, there are a lot of people out there who are NOT your ancestors. The odds against your neighbors gedcom file containing the same ancestors as yours are very high. The trick is to collect a lot of gedcom files, and reduce the odds. Unfortunately, a lot of gedcom files, and the resultant data generated from them eats disk space. Also the bigger the collection, the more time it takes to manage it, so be patient. Matches are out there, and you might hit real pay dirt. All the demo files contain real matches that I found on the genserv system, which is just a big collection, and other than big, is no different than the one you are now thinking of gathering. WHAT'S A GEDCOM FILE There are a lot of programs now that have the sole purpose of making it easier to maintain genealogy information on your ancestors. One problem with them, is none of them store their data in the same format. How do you get data from your cousin back east who uses Roots, and you use PAF? The ancestor is a gedcom file. All of the better programs will read and write a gedcom file. It's merely an ascii file you can look at with any ascii editor, but it is layed out in a strict set of rules that most of these programs stick to. You can save all your ancestor information from Roots, or Brothers Keeper to a gedcom file, and restore it all into Paf, etc. It is not designed to be used by a database management program to maintain your ancestors information. It would be extremely slow for that purpose. It's designed to be a way to exchange information. WHAT'S A SOUNDEX A soundex code is a way of representing a name that isn't to critical about how the name is spelled. It is an attempt to come up with a number that represents how a name sounds. If two names sound close, in theory they should have the same soundex code. For example the surname Smith has a soundex code of S530. The surname Smyth has the same soundex code. Many times in genealogical work, you will find surnames spelled slightly differently between generations, and when the spelling skills of the ancestor were poor, even the ancestor would spell his name several different ways. Soundex helps detect these slight variations, but of course when working with those crazy humans, even soundex isn't enough. A good example is my Moberly ancestors who switch back and forth between MOBERLY (M164) and MOBLEY (M140). Gensrch will miss these. Sigh. THE AUTHOR John Smith 28032 Singleleaf Mission Viejo California USA 92692 jsmithii@netcom.com johns@FileNet.com